Skip to content

fix: prevent WS-broadcast OOM crash under connection churn#1

Open
skialpine wants to merge 8 commits into
feat/scale-telemetryfrom
fix/ws-oom-broadcast
Open

fix: prevent WS-broadcast OOM crash under connection churn#1
skialpine wants to merge 8 commits into
feat/scale-telemetryfrom
fix/ws-oom-broadcast

Conversation

@skialpine
Copy link
Copy Markdown
Owner

Stacked on feat/scale-telemetry (PR decentespresso#57). Base will be retargeted to main once decentespresso#57 merges. The telemetry in decentespresso#57 (reset_reason, heap) is what made this diagnosable.

Root cause (from a captured + decoded panic backtrace)

Under sustained multi-client WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap collapses. The broadcast path then allocates an AsyncWebSocketMessage per client and operator new throws std::bad_alloc; Arduino-ESP32 builds -fno-exceptions, so the throw goes to std::terminate()abort() → reboot (reset_reason=panic). This is the "weight stops being collected under load" failure — not thermal (die temp was 33 °C).

Decoded stack:

operator new -> __cxa_throw -> std::terminate -> abort
AsyncWebSocketClient::_queueMessage   (AsyncWebSocket.cpp:490)
AsyncWebSocket::printfAll
sendWebsocketWeightAll                 (include/websocket.h, loop() 10 Hz broadcast)

The existing 15 KB heap watchdog can't prevent it: it has a 2 s debounce and defers reboot up to 60 s while BLE is connected, so the 10 Hz allocation bad_allocs long before it acts.

Fix

  • wsBroadcastHeapOk() heap-floor gate on every broadcast-to-all helper (sendWebsocketWeightAll, sendWebsocketStatusAll, button, power-off): when free heap is below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB watchdog) the frame is skipped, not allocated. Dropping a frame is invisible (next weight frame ≤500 ms away).
  • -D WS_MAX_QUEUED_MESSAGES=8 (lib default 32): bounds each client's outbound queue so a backed-up/half-open client can't hoard heap.
  • CLAUDE.md: documented the footgun (notes + troubleshooting table).

Verification (on hardware)

Re-ran the exact load that crashed the unpatched build — conn_churn --rst 8×8 + 10 Hz WS + mDNS, BT connected:

  • Free heap driven to 6436 bytes (old build crashed at ~4684).
  • Gate engaged: [ws] low heap 17736 < 25000 -> skip broadcast.
  • No abort, no reboot, weight stream uninterrupted (uptime continuous).

🤖 Generated with Claude Code

skialpine and others added 8 commits May 25, 2026 12:24
Diagnostics for the field "weight stops being collected under sustained load"
reports (suspected thermal). Adds to the WS status frame and serial logs:
- soc_temp_c / soc_temp_max_c: live + peak ESP32-S3 die temperature
  (temperatureRead()), sampled every 2s on the main loop.
- weight_stalled + stall_count + last_stall_ms + last_stall_temp_c: a watchdog
  in pureScale() that flags when the ADS1232 raw value is frozen/railed for >8s
  (a live cell dithers every sample), recording the die temp at the moment of
  the stall to correlate failures with heat. (This failure is not firmware-
  recoverable, so it's surfaced, not silently retried.)
- reset_reason: esp_reset_reason() captured at boot, so a brownout/panic/WDT
  reset is attributable instead of looking like a clean power-on.

Telemetry-only; no behavior change to the weight/WiFi/BLE paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nitor)

Drives USB 10Hz + WS 10Hz + HTTP/WS churn + mDNS (BT driven externally) and
polls the new temp/stall telemetry every ~60s, watching for the weight-stall
failure and the die temp at which it occurs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Review follow-ups on the telemetry watchdog:
- Skip the stall check while b_adc_recovery_active (the ADS1232 power-cycle
  freezes the raw value by design); re-seed the window on resume so a genuine
  signal-timeout recovery isn't miscounted as a railed/frozen stall.
- Check every 250 ms instead of every loop iteration -- the ADC only produces
  ~10 samples/s, so polling getDebugInfo() at full loop rate (with its sqrt +
  dataset passes) just burns CPU/heat, which is counterproductive on the chip
  we're trying to characterize.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
From the toolkit review of PR decentespresso#57:
- temperatureRead() NaN guard: don't poison g_socTempC/Max (NaN -> invalid JSON
  and a frozen peak since NaN compares false); keep last valid + log once.
- g_resetReason is now volatile (CLAUDE.md: cross-task globals read on the
  AsyncTCP status path); status frame casts it for printf.
- Expose adc_recovery_count in the status frame: a *perpetual* ADC recovery loop
  keeps re-seeding the stall window so weight_stalled may never trip -- the
  climbing recovery count makes that failure mode visible. i_adc_recovery_count
  is now volatile (newly read cross-task).
- reset_reason: numeric "unknown_<code>" fallback so unmapped IDF reset reasons
  (CPU_LOCKUP/USB/JTAG) stay attributable.
- Comment fixes: volatile cross-task rationale; stall-window re-seed wording +
  recovery-loop blind-spot note; last_stall_temp_c valid-only-if last_stall_ms.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- DURATION/IP/HOST now positional args; warn (don't silently skip) if no USB port.
- Telemetry monitor logs reset_reason + adc_recovery_count per line, waits a full
  status interval after (re)connect, tracks peak temp / stalls / recoveries /
  reboots across the whole run (so a firmware reset doesn't lose the peak), and
  prints a SUMMARY line with a PASS/FAIL verdict.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addresses the iteration-2 review findings on PR decentespresso#57:

- Status frame no longer reads the multi-field stopWatch object directly off
  the AsyncTCP task (CLAUDE.md-forbidden cross-task tear, pre-existing). The
  loop task now snapshots it into aligned volatiles (g_timerRunning/
  g_timerElapsed) that both status frames read.
- Widen i_adc_recovery_count uint8_t -> uint32_t and drop the <255 cap so a
  perpetual-recovery loop (the blind spot the stall watchdog can't see) keeps
  counting truthfully over a long soak instead of saturating; update the WS
  format specifier %u -> %lu accordingly.
- SoC temp guard: isfinite() instead of !isnan() so +/-inf can't reach the JSON.
- Stall watchdog: never store 0 as the t_rawChange timestamp (it is the reseed
  sentinel) at boot/rollover.
- README: document the new status-frame telemetry fields.
- thermal_load_test.sh: FAIL (not silent PASS) on sustained loss of status
  frames or a crashed load generator, and exit non-zero on FAIL so it works as
  a CI gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- thermal_load_test.sh: close the silent-PASS holes a review found. A
  flapping wedge (scale answers one frame per reconnect, resetting the
  consecutive-miss streak) now fails via a cumulative total_no_status counter,
  not just max_no_status_streak. Each load generator's PID is captured and
  waited on individually so a never-started/crashed driver (non-zero exit)
  fails the run instead of being missed by a Traceback-only grep. A run that
  never saw soc_temp_max_c (peak stuck at the -999 sentinel) also fails, since
  the thermal data the test exists to capture is absent.
- CLAUDE.md: add "Fixing bugs you find along the way" — pre-existing bugs get
  fixed in the same change, not deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root-caused from a captured panic backtrace: under sustained multi-client
WiFi load (WS connection churn + the 10 Hz weight broadcast), free heap
collapses and AsyncWebSocket's printfAll path allocates an
AsyncWebSocketMessage per client -> operator new throws std::bad_alloc ->
(Arduino-ESP32 is -fno-exceptions) std::terminate() -> abort() -> reboot.
That OOM-reboot is the "weight stops being collected under load" failure
(not thermal -- die temp was 33 C). Decoded stack:

  operator new -> __cxa_throw -> std::terminate -> abort
  AsyncWebSocketClient::_queueMessage (AsyncWebSocket.cpp:490)
  AsyncWebSocket::printfAll
  sendWebsocketWeightAll (websocket.h)  <- loop() 10 Hz broadcast

Fix:
- Heap-gate every broadcast-to-all helper (weight, status, button,
  power-off) with wsBroadcastHeapOk(): skip the frame when free heap is
  below WS_BROADCAST_HEAP_FLOOR (25 KB, above the 15 KB heap watchdog)
  instead of allocating into an exhausted heap and crashing. Dropping a
  frame is invisible; the next is <=500 ms away.
- Cap each client's outbound queue via -D WS_MAX_QUEUED_MESSAGES=8 (lib
  default 32) so a backed-up/half-open client can't hoard heap.
- Document the footgun in CLAUDE.md (notes + troubleshooting table).

Stacked on the scale-telemetry branch (PR decentespresso#57) whose reset_reason / heap
telemetry made this diagnosable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@skialpine skialpine force-pushed the feat/scale-telemetry branch from f3adb84 to 62a09f9 Compare May 27, 2026 14:00
skialpine added a commit that referenced this pull request May 27, 2026
Three findings from the code review on cb569d3:

1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame
   is invisible (the next weight frame is <=500 ms away, status <=5 s);
   crashing is not." The "status <=5 s" recovery guarantee no longer
   holds after this PR removes the only periodic status broadcaster.
   Rewrote to describe what's actually still gated by wsBroadcastHeapOk
   (weight + button + power_off).

2. include/websocket.h:243-247 had an addendum paragraph framing the
   surviving printfAll comment as "Note: status is no longer broadcast
   periodically...". The historical framing reads as documentation for
   a function that no longer exists; dropping it leaves the printfAll
   technical commentary applying cleanly to sendWebsocketWeightAll
   (which is what it now sits above).

3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch
   directly from the AsyncTCP task. CLAUDE.md threading model #1
   footgun: "stopWatch.* -- No -- multi-field (running flag + start ts +
   accumulator) and also mutated from loop(), BLE and USB; a status-
   frame read can tear a write across tasks." The bug is pre-existing
   on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you
   find along the way" applies. Mirroring the snapshot pattern used by
   PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h,
   refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping
   block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot
   independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial
   same-lines conflict that git auto-resolves.)

Verified on hardware:
  flash + 12s window with events on + rate 10k -> 121 weight frames at
  10 Hz, 0 periodic status frames; on-request {"command":"status"}
  returns the full 16-field frame with timer_running/timer_seconds
  populated from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
tadelv pushed a commit that referenced this pull request May 28, 2026
Three findings from the code review on cb569d3:

1. include/websocket.h:166 heap-gate doc-comment claimed "Skipping a frame
   is invisible (the next weight frame is <=500 ms away, status <=5 s);
   crashing is not." The "status <=5 s" recovery guarantee no longer
   holds after this PR removes the only periodic status broadcaster.
   Rewrote to describe what's actually still gated by wsBroadcastHeapOk
   (weight + button + power_off).

2. include/websocket.h:243-247 had an addendum paragraph framing the
   surviving printfAll comment as "Note: status is no longer broadcast
   periodically...". The historical framing reads as documentation for
   a function that no longer exists; dropping it leaves the printfAll
   technical commentary applying cleanly to sendWebsocketWeightAll
   (which is what it now sits above).

3. sendWebsocketStatus (include/websocket.h:224-225) reads stopWatch
   directly from the AsyncTCP task. CLAUDE.md threading model #1
   footgun: "stopWatch.* -- No -- multi-field (running flag + start ts +
   accumulator) and also mutated from loop(), BLE and USB; a status-
   frame read can tear a write across tasks." The bug is pre-existing
   on main (not introduced by this PR), but CLAUDE.md's "Fixing bugs you
   find along the way" applies. Mirroring the snapshot pattern used by
   PR decentespresso#57: new g_timerRunning/g_timerElapsed volatiles in parameter.h,
   refreshed once per main-loop pass in src/hds.ino's WiFi-housekeeping
   block, read by sendWebsocketStatus. (PR decentespresso#57 adds the same snapshot
   independently; merge order between decentespresso#57 and decentespresso#62 will produce a trivial
   same-lines conflict that git auto-resolves.)

Verified on hardware:
  flash + 12s window with events on + rate 10k -> 121 weight frames at
  10 Hz, 0 periodic status frames; on-request {"command":"status"}
  returns the full 16-field frame with timer_running/timer_seconds
  populated from the snapshot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant